A Probabilistic Model of the Categorical Association Between Colors
نویسندگان
چکیده
In this paper we describe a non-parametric probabilistic model that can be used to encode relationships in color naming datasets. This model can be used with datasets with any number of color terms and expressions, as well as terms from multiple languages. Because the model is based on probability theory, we can use classic statistics to compute features of interest to color scientists. In particular, we show that the uniqueness of a color name (color saliency) can be captured using the entropy of the probability distribution. We demonstrate this approach by applying this model to two different datasets: the multi-lingual World Color Survey (WCS), and a database collected via the web by Dolores Labs. We demonstrate how saliency clusters similarly named colors for both datasets, and compare our WCS results to those of Kay and his colleagues. We compare the two datasets to each other by converting them to a common colorspace (IPT). Introduction There has been growing interest in how to use color naming data to improve color models. Better color name databases[7, 10, 11, 12, 14, 2] and online naming studies[18, 8] have stimulated recent work. Color naming databases and associated models have been been useful in color transfer[5], gamut mapping[19, 20], and methods for specifying or selecting colors in an image[15, 16, 17]. In this paper, we examine the issue of how to represent and quantify the association between colors induced by names. Current methods that incorporate naming data represent the category associated with a color using either a single name[5, 6], a vector[19], or by a set of fuzzy logic memberships[1, 2, 17]. We present a probabilistic framework for working with colors. We define the categorical association of a color c as a conditional probability P(C|c) over colors C in the color space C . For a color c, the probability P(C|c) represents how likely other colors in the space C are assigned the same linguistic label as c. Our choice of using a probability over colors in our framework is motivated by the following criteria not met by current approaches. Our model satisfies three design goals. (1) Our approach can incorporate categorical effects from any number of color words, expressions involving multiple words, and different languages. (2) Our framework is based on a non-parametric model which can capture the differences in color name distributions such as “yellow” having a narrow focus and “green” having a wide distribution[21]. (3) Embedding our representation in a probabilistic framework enables us to apply a wide array of statistical and probabilistic tools to further analyze and study the effect of categories on colors. We implement our model on two datasets. We extract color naming data from six languages in the World Color Survey which contains naming information at 330 colors on the surface of the Munsell solid[7]. We also investigate online naming data collected by DoloresLabs which contains names given to 10,000 randomly sampled colors in the RGB cube[8]. Our framework can incorporate cross-linguistic data and combine contributions from color words with similar meanings. We introduce the concept of salient colors based on the statistical notion of entropy. Salient colors from our approach show good correspondence basic color terms identified by Berlin and Kay[3]. Our approach also reveals two regions that are consistently named in the sRGB cube not corresponding to typical basic color terms. We compare qualitatively the differences in salient name regions between the World Color Survey and the DoloresLabs datasets. Motivations and Related Work The goal of this paper is to present a computational framework for modeling color categories derived from experimental data. Our framework is motivated by three issues that are at best partially addressed in the current literature. 1. We would like a framework that can include all possible words for describing a color and not be limited to a predefined list of terms. 2. We would like a non-parametric model capable of capturing the details in categorical association but still be robust to noise in the naming dataset. 3. We would like a framework that can support a rich set of computational and mathematical operations, so that more in-depth studies of categorical effects can be built on the framework. In particular, our approach is grounded in probability theory. The first issue addresses how to account for the many potential expressions for describing a color. In 1969, Berlin and Kay defined color words as basic color terms if their meanings cannot be derived from other words, and proposed that there are a total of eleven basic color terms. Basic color terms were shown to be universal across languages. While some languages such as English contain all eleven terms, others may have developed only a subset of the words[3]. Subsequent studies confirmed that basic color terms are words with the highest consensus between speakers[4], but found twelve basic color terms in Russian contradicting the limit on the number of terms[22]. Kay and McDaniel hypothesized that as languages evolve, some individuals may consider additional words such as aqua/turquoise (green and blue), chartreuse/lime (yellow and green), and maroon/burgundy (red and black) as basic color terms[9]. Many existing methods assume eleven or a fixed number of color categories and cannot process the full set of responses from recent surveys such as the HP Labs Multilingual Naming Experiment[18] and the DoloresLabs Naming Dataset[8], which have hundreds of color words. Chang et al.’s category-preserving color transfer algorithm defines eleven convex regions in the color space corresponding to the basic color terms[5]. Motomura’s categorical color mapping algorithm maps foci of the eight chromatic basic color terms between the source and target gamuts[19]. Moroney’s system for translating colors to names operates on the n most frequently used color words. We want a framework where all words are included and contributions from words with similar cognitive concepts such as “maroon” and “burgundy” are combined based on their similarity. Secondly, color names exhibit different naming distributions. Colors such as “red” and “yellow” are known to have a narrow and well-defined center while colors such as “green” and “blue” are known to be composed of a broad range of hue.[21] We want our framework have the flexibility to capture the details in the distributions while being robust to noise in the data. Current approaches tend to model color categories as a volume in color space, using various parameterized models, or using non-parameterized approaches such as histograms. Partitioning the color space[12, 5] assume color names occupy discrete and non-overlapping regions in the color space. Motomura’s gamut-mapping algorithm assumes that each basic term has an ellipsoid-shaped distribution and models the distributions using an 81-parameter covariance matrix[19]. Benavente models the color naming space using a set of 6-parameter SigmoidGaussian distributions[1]. One advantage of parameterized models is that they are constructed from a small number of parameters which can be estimated accurately. In his adaptive lexical classification system, Moroney proposes an alternative implementation in which color names are represented as non-parametric histograms[16]. While histograms can capture any shape of distribution, Moroney reported noise in the data due to limited number of data points and suggests that smoothing operators or hedging be applied to post-process the histograms.1 Finally, we would like a framework capable of supporting a rich set of computational and mathematical tools. Instead of being merely a representation, the framework should allows us to perform further computation and analysis on how categories affect the way we associate colors. Treating the association between colors as a probability distribution positions our framework within the well-studied domain of probability theory. Methodology Colors and Color Words A naming dataset consists of a list of responses in the form of “color”-“color word” pairs that record the words used to describe a color. A “color” refers to the stimuli shown to a respondent and varies between datasets from Munsell color chips viewed under controlled lighting to rectangles of colors displayed on uncalibrated monitors. Unconstrained surveys allow respondents to use any expression whereas constrained surveys ask respondents to choose from a predefined list of words. An unconstrained color expression could include, e.g., “granny smith apple green”, “light robin’s egg pastel blue”, or “mix all the paint together”. In practice, most expressions recorded in unconstrained surveys consistent of a single word or a simple set of words such as “blue” or 1We should emphasize our application differs from Moroney’s in that his work is on modeling the distribution of color names while our work is on modeling the association between colors due to naming effects. “bluish green”. We will use the term “color words” from this point on even though it could refer to any possible expressions for describing a color. A naming dataset can be tabulated using a word count table where the list of all colors presented in the survey is displayed along the columns, and a list of all color words recorded is displayed along the rows. Each entry in the table indicates the number of times a corresponding color word is used to describe the corresponding color. Depending on the nature of the naming dataset, the density of word count table may vary. The World Color Survey (WCS)[7] is cross-linguistic and unconstrained, and collects naming data on a set of 330 colors. The word count table for the WCS consisting of 2300 rows by 330 columns with 20% non-zero entries. In comparison, the DoloresLabs color name dataset[8] while also unconstrained uses 10000 randomly-sampled colors. A total of 1966 expressions were recorded creating a 1966-by-10000 table that contains non-zero values for only 0.05% of the entries. We use C to denote all colors in the color space. For certain datasets such as the World Color Survey where naming data is collected on the surface of the Munsell solid, C is a two-dimensional surface in the color space. We use W to denote the set of color words in a dataset. We define two random variables C and W in our framework. C is a random variable that takes on different colors c ∈ C . The probability that C takes on the value c is P(C = c). W takes on values over the set of color words W . We use T to refer to the word count table. T (w,c) is the number of times the word w was used to describe the color c. The first relationship of interest is the conditional probability P(W |c). Given a color c, P(W |c) refers to the frequency of color words W being used to describe c, and can be computed by selecting the column in the word count table corresponding to c, and normalizing the column. This produces a probability over the set of possible color words that sums up to 1 in proportion to their frequency of use. P(w|c) = T (w,c)/∑ w T (w,c) (1) The second relationship of interest in the conditional probability P(C|W = w). Given a color word w, P(C|w) describes the likelihood of colors C being referred to by the word w. This probability can be computed by selecting the row in the word count table corresponding to w, and normalizing the row. P(c|w) = T (w,c)/∑ c T (w,c) (2) Categorical Association of a Color We now compute a conditional probability P(C|c) for each color c that summarizes how color c is associated with all other colors C due to categorical effects. This distribution describes, given a color c, likelihood of colors C in the space C being given the same name as c. We compute this distribution as follows: 1. For each color word w, we compute the conditional probability P(C|w). This distribution describes, if a color word w is used, the likely colors that w is referring to. 2. We compute the conditional probability P(W |c) = (P(w1|c),P(w2|c), · · ·). The color c may be associated Brightness = 0.2 Brightness = 0.3 Brightness = 0.4 Brightness = 0.5
منابع مشابه
Partial Association Components in Multi-way Contingency Tables and Their Statistiical Analysis
In analyses of contingency tables made up of categorical variables, the study of relationship between the variables is usually the major objective. So far, many association measures and association models have been used to measure the association structure present in the table. Although the association measures merely determine the degree of strength of association between the study varia...
متن کاملPopulation Code Dynamics in Categorical Perception
Categorical perception is a ubiquitous function in sensory information processing, and is reported to have important influences on the recognition of presented and/or memorized stimuli. However, such complex interactions among categorical perception and other aspects of sensory processing have not been explained well in a unified manner. Here, we propose a recurrent neural network model to proc...
متن کاملDeterminants of Inflation in Selected Countries
This paper focuses on developing models to study influential factors on the inflation rate for a panel of available countries in the World Bank data base during 2008-2012. For this purpose, Random effect log-linear and Ordinal logistic models are used for the analysis of continuous and categorical inflation rate variables. As the original inflation rate response to variables shows an appar...
متن کاملRANDOM FUZZY SETS: A MATHEMATICAL TOOL TO DEVELOP STATISTICAL FUZZY DATA ANALYSIS
Data obtained in association with many real-life random experiments from different fields cannot be perfectly/exactly quantified.hspace{.1cm}Often the underlying imprecision can be suitably described in terms of fuzzy numbers/\values. For these random experiments, the scale of fuzzy numbers/values enables to capture more variability and subjectivity than that of categorical data, and more accur...
متن کاملThe Reliable Hierarchical Location-allocation Model under Heterogeneous Probabilistic Disruptions
This paper presents a novel reliable hierarchical location-allocation model where facilities are subject to the risk of disruptions. Based on the relationship between various levels of system, a multi-level multi-flow hierarchy is considered. The heterogeneous probabilistic disruptions are investigated in which the constructed facilities have different site-dependent and independent failure rat...
متن کامل